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In this paper an inverse preconditioner for the numerical solution of an elliptic Laplace prob- 
lem of a global circulation ocean model is presented. The inverse preconditiong technique is 
adopted in order to efficiently compute the numerical solution of the elliptic kernel by using 
the Conjugate Gradient (CO) method. We show how the performance and the rate of conver- 
gence of the solver are linked to the discretized grid resolution and to the Laplace coefficients 
of the oceanic model. Finally, we describe an easy-to-implement version of the solver on the 
Graphical Processing Units (GPUs) by means of scientific computing libraries and we discuss 
its performance. 
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1. Introduction 

The ocean is now well known to play a dominant role in the climate system because 
it can initiate and amplify climate change on many different time scales. Hence, 
the simulation of ocean model is became a relevant but highly complex task and 
it involves an intricate interaction of theoretical insight, data handling and numer- 
ical modelling. Over the past several years, ocean numerical models have become 
quite realistic as a result of improved methods, faster computers, and global data 
sets. Models now treat basin-scale to global domains while retaining the fine spa- 
tial scales that are important for modelling the transport of heat, salt, and other 
properties over vast distances. Currently, there are many models and methods em- 
ployed in the rapidly advancing field of numerical ocean circulation modelling as 
Nemo, Hops, MOM , POP et al (see [2] for a nice review). However, several of these 
numerical models are not yet optimized by using scientific computing libraries and 
"ad hoc" preconditioning techniques. In all these frameworks the numerical kernel 
is represented by the discretization of the Navier-Stokes equations [4] on a three 
dimensional grid and by the computation of the evolution time of each variable 
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for each grid point. The high resolution computational grid requires efficient pre- 
conditiong techniques for improving the accuracy in the computed solution and 
parallelization strategies for answering to the huge amount of computational de- 
mand. 

In this paper we propose a new solver based on preconditioned conjugate gradient 
(PCG) method with an approximate inverse preconditioner AINV [7] for the nu- 
merical solution of the elliptic sca-surface equation in NEMO-OPA ocean model [3], 
a state of the art modelling framework in the oceanographic research. The PCG is 
a widely used iterative method for solving linear systems with symmetric, positive 
definite matrix and it has proven its efficiency and robustness in a lot of appli- 
cations. The preconditioning is often a bottleneck in solving the linear systems 
efficiently and it is well established that a suitable preconditioner increases the 
performance of an application dramatically. 

The elliptic sea-surface equation is originally solved in NEMO-OPA by means of 
the PCG with diagonal preconditioner, and in our work we prove that it is ineffi- 
cient and inaccurate. We build a new inverse preconditioner and we implement the 
PCG on a Graphic Processor Unit (GPU) by means of the linear algebra Scientific 
Computing libraries. The GPUs arc massively parallel architectures that efficiently 
work with the linear algebra operations and give impressive performance improve- 
ments. They require a deep understanding of the underlying computing architecture 
and the programming with these devices involves a massive re-thinking of existing 
CPU based applications. A challenge is how to optimally use the GPU hardware 
adopting adequate programming techniques, models, languages and tools. In this 
paper, we present an easy-to-implement version of the elliptic solver with the sci- 
entific computing libraries on Compute Unified Device Architecture (CUDA) [15]. 
We implement a code by using CUDA based supported libraries CUBLAS [16] and 
CUSPARSE [17] for the sparse linear algebra and the graph computations. The 
library GPU based approach allows a short code development times and an easy 
to use GPUs implementations that can fruitfully speed up the expensive numerical 
kernel of an oceanographic simulator. The paper is set out as follows. In section 2 
we briefly review the mathematical model: elliptic equations that are at the heart 
of the model. In section 3, the preconditioned conjugate gradient method used to 
invert the elliptic equations are described. In section 4, we outline a implementation 
strategy for solving the elliptic solver by using standard hbraries and in section 5 
we discuss the mapping of our algorithm onto a massively parallel machine. Finally, 
the conclusions are drawn. 



2. The Mathematical Model 

Building and running ocean models able to simulate the world of global circulation 
with great realism require a variety of scientific skills. In modelling the general 
ocean circulation it is necessary to solve problems of elliptic nature. These prob- 
lems are difficult to solve, with the following issues causing the most trouble in 

practice [2]: 

(1) In simulations with complicated geometry (e.g., multiple islands), topogra- 
phy, time varying surface forcing, and many space-time scales of variability 
(i.e., the World Ocean), achieving a good first guess for the iterative elliptic 
solver is often quite difficult to achieve. This makes it difficult for elliptic 
solvers to converge to a solution within a reasonable number of iterations. 
For this reason, many climate modellers limit the number of elliptic solver 
iterations used, even if the solver has not converged. This approach is very 
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unsatisfying. 

(2) Many elliptic solvers with their associated non-local and time dependent 
boundary conditions (be they Neumann or Dirichlet) do not project well 
onto parallel distributed computers, which acts to hinder their scaling prop- 
erties [23-25]. 

In the OPA-NEMO numerical code the primitive equations are discretized within 
sea-surface hypothesis [1] and the model is charecterized by the three-dimensional 
distribution of currents, potential temperature, salinity, preassure and density [4]. 
The numerical method OPA-NEMO is grounded on discretizing of the primitive 
equation - by the use of finite differences on a three dimensional grid - and com- 
puting the time evolution to each variable " ocean" at each grid point for the entire 
globe [6]. A sketch of the OPA-NEMO computational model, see Figure 1, shows 
the complex dynamic processes that mimic the ocean circulation model, composed 
by steps that are many time simulated. 
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Figure 1. NEMO-OPA model. 



The kernel algorithm (highlighted with red color, equation 1) solves the sea-surface 
hight equation r/ The elliptic kernel is discretized with a semi-discrete equations, 
as following: 



ry"+i = r;"-i - (1) 

2AtgTcAhD''+^ = D''+^ - D""-^ + 2AtgAi,'n" (2) 

A, - V[(i7 + 77")V]. (3) 

where r/", n G N is the sea-surface height at the n— th step of the model, which 
describes the shape of the air-sea interface, -D" is the centered difference approx- 
imation of the first time derivative of tj, At is the time stepping, g is the gravity 
constant, Tc is a physical parameter. Ah is the horizontal Laplacian operator and 
H is the depth of the ocean bottom [3]. Whereas the domain of the ocean models is 
the Earth sphere (or part of it) the model uses the geographical coordinates system 
(A, (p, r) in which a position is defined by the latitude (/>, the longitude A and the 
distance from the center of the earth r = a + z{k) where a is the Earth's radius and 
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z the altitude above a reference sea level. The local deformation of the curvilinear 
geographical coordinate system is given by ei,e2 and 63: 



ei = rcos(j), 62 = r, 63 = 1. 

The Laplacian Operator in spherical coordinates A/i£)"+^ in (2) becomes: 

1 



AhD 



n+l 



6162 



(4) 



(5) 



where: 



a{4>) = {H + rfY^lex (6) 
/?(</>) = (i/ + 77")ei/e2 (7) 
For the functions in (6) and ^(<^) in (7), we have the following relations: 



lim q;((/)) = +00 A lim /3(0) =0 
— >-±5 4> — ^-if 



(8) 



From (8), if we choose M, e G R with M » e then exists an interval [| — 5, |] or 
[ — |, — f + S], such that the following inequality holds: 



a((/)) > M » e > 



(9) 



In physical terms, in the proximity of the geographical poles, {X, ±(j)/2,r), there 

are several orders of magnitude between the functions a{(f)) and 

The result (9), will significantly influence the rate of convergence in the iterative 

solver. 



3. Inverse Preconditioned Techniques in the Elliptic Solver of the Ocean 
Model 

Let us now consider the elliptic NEMO model [3] defined by the following coeffi- 
cients: 

C.^^f = 2At^H{i,j)e,iiJ)/e2{i,j), Cf^ = 2At^H{i,j)e2ii,j)/e^{i,j) . . 

where 5i and 5j are the discrete derivative operators along the axes i and j. The 
discretization of the equation (2) by means of a five-point finite difference method 
gives: 

Ct'fD,.,, + C^A.-i - {Ciii^ + CfJlr + C^f + CFr)D,,+ ^^^^ 

where the equation (11) is a symmetric system of linear equations. All the elements 
of the sparse matrix A vanish except those of five diagonals. With the natural 
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ordering of the grid points (i.e. from west to east and from south to north), the 
structure of A is a block-tridiagonal with tridiagonal or diagonal blocks. The matrix 
A is a positive-definite symmetric matrix with n = jpi x jpj size, where jpi and jpj 
are respectively the horizontal dimensions of the grid discretization of the domain. 
The Conjugate Gradient Method is a very efficient iterative method for solving 
the linear system (11) and it provides the exact solution in a number of iterations 
equal to the size of the matrix. The convergence rate is faster as the matrix is closer 
to the identity one. By spectral point of view a convergence relation between the 
solution of the linear system and its approximation Xm is given by: 

||x - x„|U < 2(^^-l)'""||x - xolU (12) 

with /U2(A) = \max/^min, where Xmax and ^min are respectively the greatest and 
the lowest eigenvalue of A, and || • \\a is the A-norm. The preconditioning framework 
consists to introduce amatrix M, that is an approximation of A easier to invert, 
and to solve the equivalent linear system: 

M^^Ax = M^^b (13) 

The ocean global model NEMO-OPA uses the diagonal preconditioner, where M 
is chosen to the diagonal of A. Let us introduce the following cardinal coefficients: 



K _ ^ EW 11 W _ EW I, 



(14) 
(15) 



where di^j = {CUfj + C^]^-^ + C[^f + C^j^) represents the diagonal of the matrix 
A. The (11), using the diagonal preconditioner, can be written as: 



(16) 



with bij = -bij/dij. 

Starting from the observations (8) and (9) we proof that the diagonal preconditioner 
does not work very well in some critical physical situations involving curvilinear 
spherical coordinates. 

Proposition 3.1 In the geographical coordinate, if (j) ^ ~^ 0, 

then the conditioning number /x(M~-'^A) goes to +oo. 



Proof. In the geographical coordinate, i.e. when (i, j) — >■ (A, 4)) and (ei , ^2) (r cos r), 
for (j) -> +f AA -> and A(/) 0. the functions and in (14) go to -1/2 while 
fi^ and (3^ in (15) go to 0. Hence the limit of matrix M^-'-A is given by: 



A' = 



1 -1/2 ... 

-1/2 1 -1/2 '■ 

■■ ■■ 

■■. ■■■ ■■■ -1/2 

■■ -1/2 1 



(17) 



The eigenvalues of the matrix in (17) are: 
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/ fcTT \ 

Afc = l + cos k = l,...,n (18) 

\n + 1 J 

and then the condition number /i2(M^^A) = Xmax/^min ~ r? jl (by using the series 
expansion of cos a; = 1 — a;^/2 + o(a;^)). Moreover for A A — )■ and A(/) the size n of 
the matrix A goes to +cx) and hence we obtain the thesis. ■ 

By the proposition (3.1), for n large and (\) —> ±7r/2, it is preferable to adopt more 
suitable preconditioning techniques or a stategy based on the local change of the 
coordinates at poles. In this paper we propose an alternative approximate sparse 
inverse preconditioning AINV techniques [7] for the linear system (11). AINV tech- 
nique is a critical step since the inverse of a sparse matrix is usually dense. The 
problem is how to build a preconditoner that preserves the sparse structure. We 
introduce a factored sparse approximate inverse FSAI preconditioner P = ZZ* 
[8, 11], computed by means conjugate-orthogonalization procedure. Specifically, 
we propose an "ad hoc" method for computing an incomplete factorization of the 
inverse of the matrix T C A, obtained by A taking only the elements a^j- such 
that \j — i\ < 1. The factorized sparse approximate inverse of T is used as explicit 
preconditioner for (11). In the following we give some remarks for the sparsity 
pattern selection S of our inverse preconditioner P. 

Proposition 3.2 If T is a tridiagonal, symmetric and diagonally dominant ma- 
trix, with diagonal elements all positive tk^k > 0, k = l,...,n, then the Cholesky's 
factor U of the matrix T is again diagonally dominant. 

Proof. Since T is a tridiagonal matrix then U is a bidiagonal matrix. Using the inductive 

method wc proof that U is diagonally dominant matrix. For fc = 1 is trivially, indeed by 
hypothesis we know that |ai,i| > |ai,2| *^=^ I'^iil > |wi,iui,2|, then we obtain > 
|mi,2|. Moreover placed the thesis true for A; — 1 i.e. \uk-i,k-i\ > \uk-i,k\ then by the 
following inequalities: 

\a.k,k\ > |afe,fe-i| + |afe,fe+i| 'S=> 

|wfe_i,fc + Wfc.fcl > \uk-i,kUk-i,k-i\ + \uk,kUk,k+i\ > Uk-i,k + \uk,kUk,k+i\- (19) 
subtracting the inequality (19) for uf._-^ j,, then the thesis holds also for k. ■ 
This result allows to prove the following proposition: 

Proposition 3.3 The inverse matrix Z of a bidiagonal and diagonally dominant 
matrix U has column vectors Zk,k = 1, ...n such that starting from diagonal element 
Zk^k, they contain a finite sequence {zk-i^k}i=o,...,k-i strictly decreasing. 

Proof. Applying a backward substitution procedure for solving the system of equations 
Uzfe = efe, we get: 

' l/wfe,fe if i = 

i 

{-iy/Uk,k ■ fJ(Wfe-r,/c-r-+l/Wfc-r,fe-r) ^^^^ 
r=l 

if 0<i<k-l. 

By means of the preposition (3.2) we obtain that Zk-i,k > Zk-i-i,k with < « < A; — 1 
and hence the thesis is proved. ■ 

The previous propositions (3.2) and (3.3) enable to select a sparsity pattern S by 
the following scheme: 



^k—i.k 
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(1) Consider the symmetric, diagonally dominant aind triangular matrix 
T, obtained by A taking only the elements Uij such that \j — i\ < 1 

(2) T ~ U^U is diagonally dominant matrix. Consequently its Cholesky 
factor U is diagonally dominant (proposition (3.2) ). 

(3) U is a bidiagonal and diagonally dominant matrix. Z = has 
columns vector Zfc, k = l,...,n such that Zk-i,k > -2fe-i-i,fe with 
0<i<k-l. (proposition (3.3)) 

(4) Fixed an upper bandwidth q, the entries Zij with j > i + g of Z are 
considered negligible. 

(5) The preconditioner P = ZZ* is built as: 

{Zi.j if j <i + q 
(21) 
if j>i + q 

(6) The sparse factor Z is computed by T-orthogonalization procedure 
posing the sparsity pattern S={(i,j) /j > i + q} 

T is a diagonally dominant matrix then the incomplete inverse factorization of T 
exists for any choice of the sparsity pattern S on Z [8] . 

Prom computationally point of view, the T-orthogonalization procedure with the 

sparsity pattern S is based on matrix-vector operations with computational cost 
of 5(g + 1) floating point operations. Moreover, for each column vector of Z wc 
work only on its g + 1 components Zk[k — q],Zk[k — q + l],...,Zk[k] with consequently 
global complexity of 5q{q + l)O(n). 

4. Practical Considerations 

In this section we give some practical details on the elliptic solver implementation 
with FSAI preconditioner, on a generic GPU architecture. The matrices A, Z and 
Z^ are stored with the special storage format Compressed Sparse Row (CSR). 
The FSAI is performed in serial on the CPU and its building requires a negligible 
time on total execution of the elliptic solver. We show the implementation of the 
Algorithm 1 outlines on the CPUs [12, 13, 20]. 

Algorithm 1 FSAI-PCG solver 
1: A: = 0; xo = -D°j = 2-D*j\ the initial guess; 
2: To = b - Axo; 

3: So = ZZ*ro, with P = ZZ* the FSAI preconditioner; 

4: do = So; 

5: while (||rfe||/||b|| > e .and. k < n) do 

6: Qfe = Adfc; afc = (sfc,rfc)/(dfc,qfc); Xfe+i = Xfe + afedfe; 

7: rfe+i = rfe - afeQfe; s^+i = ZZ*rfc+i; Pk = (sfe+i, rfe+i)/(sfe, r^); 

8: dfe+1 = rfc+1 -t-/3fcdfc; k = k + l; 

9: end while 



In details, our solver is implemented by means of the CUDA language with 
the auxiliary linear algebra libraries CUBLAS, for the "dot product " (xDOT), 
"combined scalar multiplication plus vector addition" (xAXPY), "euclidean norm" 
(xNRM2) and "vector by a constant scahng" (xSCAL) operations, and CUSPARSE 
for the sparse matrix-vector operations in the PCG solver. The linear algebra sci- 
entific libraries are extremely helpful to easily implement a software on the GPU 
architecture. A "by-hand" implementation (see Figure 2 and 3) of the solver with- 
out the library features in reported as a tedious GPU programming example. In this 
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1 /*GRII] and BLOCK CONFIGURATION*/ 

2 const int warpsize = 32; //Number of threads in a Warp. 

3 canst int maxGridSize = 65535; //HaxinLin rumber cf simultaneous Block.s. 

4 int warpCount = (n / warpSize] + (((n \% warpSize) =0) ? fl : 1); //Number ai Warps. 

5 int ttireadCnuntPerBlock = warpSize; //Nuniber of threads per Block. 

6 int blockCoiJiit = rain (maxGridSize .warpCnunt ) ; //Number aX Etccks per kernel. 

7 iLitH BlockDim = diii3(threadCoiirtPerBlocK, 1, 1); //Set Block, dimension. 

8 (Iiin3 GridDim = diii3 [blockCount , 1, 1); //Set Grid dimension. 

Figure 2. Grid and block configuration. 



1 /*MATflIX- VECTOR PRODUCT OF A SPARSE MATRIX IN THE CSR FOmAT*/ 

2 global void spmvcsrscalarkernel (int n, int* start, const int* j, 

3 const float " const float- d, float* yl 

4 { 

5 int i = blockoim.x • blockldx.x + threadldx.x ; 

6 if ( i < n H 

7 int prl=start[i] ; 

8 int pr2=start[i+l] -1; 

9 y[i]=a; 

IS for (int q =prl; q<=pr2; q++} 

11 y[i]+=i-a[q]*d[j[q]]; 

12 } 

m ' 

IS ... 

16 

17 int main(int argc, char* argv[]){ 

18 ... 

19 spmv_csr_scalar_kernel<:isSridDiro,BlockDiri»5-(n,d_start,d_j , d_a ,d_x , d_y); 
2B ... 

21 } 

Figure 3. Naive GPU implementation of a function for the matrix-vector product of a 
matrix in CSR format. 

type of coding, the manually configuration of the grid of thread blocks is necessary. 
For example if we use the TESLA S2050 the variables warpSize and maxGridSize 
(respectively at lines 2 and 3) have to be assigned. In details, warpSize indicates 
the number of threads (32) in a warp, which is a sub-division use in the hardware 
implementation to coalesce memory access and instruction dispatch; maxGridSize 
is the maximum number of simultaneous blocks (65535). Furthermore, warpCount 
(at line 4) represents the number of warps and it depends on the dimension n 
of the problem. In the end, variables threadCountPerBlock and blockCount (re- 
spectively at lines 5 and 6) are the parameter used for setting the grid and block 
configuration (lines 7 and 8). For example, if n = 10000 is the size of a vector, 
then threadCountPerBlock = 32 and blockCount = 313. 

The figure 3 shows the matrix- vector product, with the matrix stored in CSR 
format. We observe that this implementation requires a large amount of kernel 
functions, invoked by the "host" (CPU) and executed on the "device" (GPU). 
In the following, we will show how to implement the Algorithm 1 outlines by using 
library features. In order to use the CUBLAS library it is necessary inizialize it by 
means of the following instructions: 



cublasStatus stat ; 
cublaslnit () ; 



For the use of the CUSPARSE library two steps are necessary. The first one consists, 
as follow, in the library initialization: 



cusparseHandle_t handle=0; 
cusparseCreate (fehandle) ; 



moreover, it is recalled the creation and setup of a matrix descriptor: 
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cusparseMatDescr_t descra=0; 
cusparseCreateMatDescr (fedescra) ; 

cusparseSetMatType(descra,CUSPARSE_MATRIX_TYPE_GENERAL) ; 
cusparseSetMatIndexBase(descra,CUSPARSE_INDEX_BASE_ZERO) ; 



The library avoids to configure the grid of the thread blocks and it allows to write 
codes in a very fast way. For example, at line 6 of the Algorithm 1 the computation 
of qfc = Adfe is required, and this operation can be made simply by calling the 
CUSPARSE routine cusparseScsrmvO , that performs the operation q = aAd+6q 
as follows: 



cusparseScsrmv (handle, CUSPARSE_OPERATION_NON_TRANSPOSE, n, n, 
a, descra. A, start, j, d, b, q); 



In our context, A, j and start represent the symmetric positive-definite matrix 
A, stored in the CSR format. More precisely, the vector A denotes the non-zero 
elements of the matrix A, j is the vector that stores the column indexes of the non- 
zero elements, the vector start denotes, for each row of the matrix, the address 
of the first non-zero clement and n represents the row and columns number of the 
square matrix A. The constants a and h are assigned to 1.0 and 0.0 respectively. 
Moreover it happens that at line 6 of the Algorithm i, the computation of (s^, r^) 
is performed by means the CUBLAS routine for the dot product: 

alfa_num = cublasSdot (n, s, 1NCREMENT_S, r, INCREMENT_R) ; 



The constants INCREMENT_S and INCREMENT _R are both assigned to 1. Last opera- 
tion of line 6 in Algorithm 1, is the ^x-k+i = + Ofcdfc for updating the solution 
and it is implemented by calling the CUBLAS routine for the saxpy operation: 

cublasSaxpyCn, alfa, d, INCREMENT_D, x, INCREMENT_X) ; 



The constants INCREMENT_D and INCREMENTJC are both assigned to 1. In addition, 
the computation of Sk+i = ZZ*rk+i at line 7 is the preconditioning step of the 
linear system (11) and it is computed by means of two matrix-vector operations 
performed as: 



cusparseScsrmv (handle , 


CUSPARSE_ 


OPERATION_NON_TRANSPOSE, n, n. 


a, descra, Z_t 


, start_Z_ 


t , j_Z_t , r , b, zt) ; 


cusparseScsrmv (handle , 


CUSPARSE. 


OPERATION_NON_TRANSPOSE, n, n, 


a, descra, Z_t 


, start_Z, 


j_Z, zt, b, z); 



In details, in the first call of cusparseScsrmvO, zt = Z*rk+i and Sk+i = 
Zzt are computed. We have outlined just few of computational operations be- 
cause the other will be performed in the same way. The parameters handle, 
CUSPARSE_0PERATI0NJJ0N_TRANSP0SE and descra are discussed in the NVIDIA report [17] 
in more detailed way. In summary, we highlight how the use of the standard library, 
designed for the GPU architecture, allow to optimize the computational oceano- 
graphic simulation model. 
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Matrix Name 


Size 


Matrix non-zeros elements 


ORCA-2 


180 X 149 


133800 


ORCA-05 


751 X 510 


1837528 


ORCA-025 


1442 X 1021 


7359366 



Tabic 1. NEMO-OPA grid resolutions. 



A Dimension 


P (f. a 0) 


P-^0=«O) 


P (0 » 7r/2) 


P ^ (<i » 7r/2) 


ORCA-2 


271 


460 


8725 


26820 


ORCA-05 


1128 


1593 


22447 


86280 


ORCA-025 


2458 


3066 


28513 


139742 



Tabic 2. Comparison between P and P ^ in terms of Number of Iterations of the PCG in the case A is well- 
conditioned (0 ^ 0) and A ill-conditioned (0 ^ 7^/2), varying the problem dimensions. 



5. Numerical Experiments 

In this section we focus on the important numerical issues of our elhptic solver 
implemented with GPU architecture in single precision. The solver is tested on 
three grid size resolutions of the NEMO-OPA ocean model (Table 1). 

In the Table 2, we compare the performance in terms of PCG iterations of the 
proposed inverse bandwidth preconditioner P respect to P~^, that is the diagonal 
NEMO-OPA preconditioner. We fix an accuracy of e = 10~^ on the relative residue 
r = 1 1 Ax — b||/| |b| I on the linear system solution. The experiments are carried out 
in the case of well-conditioned A matrix, corresponding to the geographical case 
of (/) ~ and in the case of ill-conditioned A with (p ~ 7r/2. We can observe as in 
the worst case with n large and A ill-conditioned the PCG with has a very 
slow convergence with a huge number of iterations to reach the fixed accuracy. The 
experiment, reported in the Table 2, highlights the poor performance of the P~^ 
for solving the Laplace elliptic problem (11) within OPA-NEMO. 

In the following, we show the performance in terms of PCG iterations of P 
respect to the AINV Bridson Class preconditioners, that believe to CUSP library. 
To be more specific, let us consider the P^i and Pb2 Bridson's preconditioners, 
obtained by means of the ^-orthogonalization method. The first is given by posing 
a (fixed) drop tolerance and by ignoring the elements below the fixed tolerance [9] 
and in the second one is predetermined the number of non-zeros elements on each 
its row. [10]. 




ORCA-2 ORGA-05 ORCA-025 

Matrix Size 



Figure 4. Comparison between P, P^i and 
P B2 in terms of Number of Iterations of the 
PCG (y— axis) when A is well-conditioned, 
varying the problem dimensions (x— axis) 




ORCA-2 ORCA-05 DFtCA-025 

Matrix Size 



Figure 5. Comparison between P, Fbi and 
Pb2 in terms of Number of Iterations of 
the PCG (y— axis) when A is ill-conditioned, 
varying the problem dimensions (x— axis) 



The required accuracy on the solution is fixed to e = 10 on the relative residue. In 
Figure 4 we report the PCG iterations of P, P^i and Pb2 in the case of the matrix 
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Matrix Name 


Non-zeros Elem. 


Mem. Occ. 


ORCA-2 


133800 


4MB 


ORCA-05 


1837528 


37MB 


ORCA-025 


7359366 


135MB 



Tabic 3. Matrix memory occupancy. Mem Occ. is the full memory allocated memory on the GPU. 



A well-conditioned (i;^ ~ 0). In Figure 5 we present the case of the ill-conditioned 
((/) ~ '?i"/2) matrix A. The numerical results show as the number of iterations of 
the solver P is comparable to P^i and Pb2 when the dimensions of the problem 
are small or middle. Furthermore, it is strongly indicated to use P with a huge 
problem dimension. We test P, P""*", P^i and Pb2 on the sparse matrix N0S6 of 
the Market Matrix database [22] by setting the required accuracy on the computed 
solution to e = 10~® and the band q of the preconditioner to 4. This sparse matrix 
is obtained in the Lanczos algorithm with partial re-orthogonalization Finite differ- 
ence approximation to Poisson's equation in an L-shaped region, mixed boundary 
conditions. 

The figure 6 shows as P achieves the best performance in terms of iterations. 
Finally, we test the elliptic solver implementation on GPU architecture. The nu- 
merical experiments are carried out on an "NVIDIA TESLA S2050" card, based 
on the "FERMI GPU". The "TESLA S2050" consists of 4 GPGPUs, each of which 
with 3GB of RAM memory and 448 processing cores working at 1.55 GHz. All 
runs are given on 1 GPU device. We have adopted CUDA release 4.0, provided 
by NVIDIA as a GPGPU environment and the numerical code is implemented by 
using the single precision arithmetic. 

As described in the previous sections by using scientific computing library it is not 
necessary manually setting up the block and grid configuration on the memory de- 
vice. The number of blocks required to store the elliptic solver input data (in GSR 
format) do not have to exceed the maximum sizes of each dimension of a GPU grid 
device. Schematic results of GPU memory utilization for ocean model resolutions 
are presented in the Table 3. Observe that in our numerical experiments we do not 
fill the memory of the TESLA GPU and the simulations run also on cheaper or 
older boards, as for example the Quadro 4700FX. Generally, it is possible to grow 
the grid dimensions of the ocean model according to the memory capacity of the 
available GPU. 

The elliptic solver requires a large amount of Sparse-Matrix Vector 
(cusparseCsrmv) multiplications, vector reductions and other vector operation to 
be performed. CPU version is implemented in ANSI C executed in serial on a 
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2.4GHz "Intel Xeon E5620" CPU, with 12MB of cache memory. Serial and GPU 
versions are in single precision. We test the performance of the solver in terms 
of Floating Point Operations (FLOPS). The performance of the numerical experi- 
ments (reported in the Figure 8) are given in the case of A ill-conditioned matrix 
{(j) ~ '7r/2). We count an average of the iterations of solver and the complexity of all 
linear algebra operations involved in both serial and parallel implementations. The 
"GPU solver" (blue) and "CPU solver" (orange) curves represent the GFLOPS of 
the solver, respectively, for the CPU and GPU versions. 




The main recalled computational kernels in the solver are the Sparse-Matrix Vector. 
From the Figure 7 we highlight the improvement in terms of GFLOPs speed-up 
by replacing gemvO with cusparseCsrmvO function. These results prove that, 
increasing the model grid resolution, it is possible to exploit the computational 
power of the GPUs. In details, the GPU solver implementation in the ORCA-025 
configuration has a peak performance of 87 GFLOPS respect to 1,43 GFLOPS of 
the CPU version. 



6. Conclusions 



The ocean modelling is a challenging application where expensive computational 
kernels are fundamental tools to investigate the physics of the ocean and the cli- 
mate change. In a lot of applications, the elliptic Laplace equations are used in 
the complex mathematical models; they represents critical computational points 
since the convergence of the numerical solvers to a solution, within a reasonable 
number of iterations, it is not always guaranteed. In our case, this happens to the 
preconditioning technique of the OPA-NEMO ocean model, for which we prove 
to be inefficient and inaccurate. In this paper, we have proposed a new inverse 
preconditioner based on the FSAI method that shows better results respect to 
the OPA-NEMO diagonal one and to others of the Bridson class. Moreover, an 
important contribute is given by an innovative approach for parallelizing the el- 
liptic solver on the Graphical Processing Units (GPU) by means of the scientific 
computing libraries. The library based implementation of the computing codes al- 
lows to optimize oceanic framework reducing the simulation times and to develop 
computational solvers easy-to-implement. 
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